Introduction

link here for airplane crash data

This dataset is over the airplane crashes and fatalities that have occured throughout the years since 1908. Many of these are unknown while others, are historical events such as 9/11. With this data, we will be able to gain a better understanding on how the type of aircraft is correlated to the number of crashes and fatalities.

Get Ready

library(readr)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ purrr     1.0.2
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(stringr)
library(readr)
library(ggplot2)
library(gtable)
library(grid)
library(extrafont)
## Registering fonts with R
library(plotly)
## 
## Attaching package: 'plotly'
## 
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following object is masked from 'package:graphics':
## 
##     layout

We will read in data and store in a new variable

air <- read_csv("Airplane_Crashes_and_Fatalities_Since_1908.csv", show_col_types = FALSE)
glimpse(air)
## Rows: 5,268
## Columns: 13
## $ Date         <chr> "09/17/1908", "07/12/1912", "08/06/1913", "09/09/1913", "…
## $ Time         <chr> "17:18", "06:30", NA, "18:30", "10:30", "01:00", "15:20",…
## $ Location     <chr> "Fort Myer, Virginia", "AtlantiCity, New Jersey", "Victor…
## $ Operator     <chr> "Military - U.S. Army", "Military - U.S. Navy", "Private"…
## $ `Flight #`   <chr> NA, NA, "-", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ Route        <chr> "Demonstration", "Test flight", NA, NA, NA, NA, NA, NA, N…
## $ Type         <chr> "Wright Flyer III", "Dirigible", "Curtiss seaplane", "Zep…
## $ Registration <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ `cn/In`      <chr> "1", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ Aboard       <dbl> 2, 5, 1, 20, 30, 41, 19, 20, 22, 19, 28, 20, 20, 23, 21, …
## $ Fatalities   <dbl> 1, 5, 1, 14, 30, 21, 19, 20, 22, 19, 27, 20, 20, 23, 21, …
## $ Ground       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ Summary      <chr> "During a demonstration flight, a U.S. Army flyer flown b…

Data column descriptions

This is the dataset column descriptions:

Date: Date of accident, in the format – January 01, 2001

Time: Local time, in 24 hr. format unless otherwise specified

Location: Location of the accident

Operator: Airline or operator of the aircraft

Flight: Flight number assigned by the aircraft operator

Route: Complete or partial route flown prior to the accident

Type: Aircraft type

Registration: ICAO registration of the aircraft

cn/ln: Construction or serial number / Line or fuselage number

Aboard: Total aboard (passengers / crew)

Fatalities: Total fatalities aboard (passengers/crew)

Ground: Total killed on the ground

Summary: Brief description of the accident and cause if known

DATA PREPARATION

These are the missing values we had -The missing values in the time column accounts for 30% of the data in the column.

-The missing values in the flight column accounts for 74% of the data in the column.

-The missing values in the registration column accounts for 5% of the data in the column.

-The missing values in the cn/ln column accounts for 13% of the data in the column.

I decide to drop those 4 columns.

air = subset(air, select = -c(`Flight #` , Route, Registration, `cn/In`))

As we can see, the Date column is in the format Month Day, Year

Because I will analyze on separate month, separate day, and separate year, I will extract the years from the Data column.

First, I check the data type of the Date column. The Date column is in character.

class(air$Date)
## [1] "character"

I convert Date from character to class ‘date’. I extract year from the Date column

air$Date <- as.Date(air$Date, format = "%m/%d/%Y")
air$year <- as.numeric(format(air$Date, "%Y"))
air %>% mutate( year = air$year)
## # A tibble: 5,268 Ă— 10
##    Date       Time  Location     Operator Type  Aboard Fatalities Ground Summary
##    <date>     <chr> <chr>        <chr>    <chr>  <dbl>      <dbl>  <dbl> <chr>  
##  1 1908-09-17 17:18 Fort Myer, … Militar… Wrig…      2          1      0 During…
##  2 1912-07-12 06:30 AtlantiCity… Militar… Diri…      5          5      0 First …
##  3 1913-08-06 <NA>  Victoria, B… Private  Curt…      1          1      0 The fi…
##  4 1913-09-09 18:30 Over the No… Militar… Zepp…     20         14      0 The ai…
##  5 1913-10-17 10:30 Near Johann… Militar… Zepp…     30         30      0 Hydrog…
##  6 1915-03-05 01:00 Tienen, Bel… Militar… Zepp…     41         21      0 Crashe…
##  7 1915-09-03 15:20 Off Cuxhave… Militar… Zepp…     19         19      0 Explod…
##  8 1916-07-28 <NA>  Near Jambol… Militar… Schu…     20         20      0 Crashe…
##  9 1916-09-24 01:00 Billericay,… Militar… Zepp…     22         22      0 Shot d…
## 10 1916-10-01 23:45 Potters Bar… Militar… Zepp…     19         19      0 Shot d…
## # ℹ 5,258 more rows
## # ℹ 1 more variable: year <dbl>
class(air$year)
## [1] "numeric"

I did the same thing to extract day from the Date column.

air$Date <- as.Date(air$Date, format = "%m/%d/%Y")
air$day <- as.numeric(format(air$Date, "%d"))
air %>% mutate( day = air$day)
## # A tibble: 5,268 Ă— 11
##    Date       Time  Location     Operator Type  Aboard Fatalities Ground Summary
##    <date>     <chr> <chr>        <chr>    <chr>  <dbl>      <dbl>  <dbl> <chr>  
##  1 1908-09-17 17:18 Fort Myer, … Militar… Wrig…      2          1      0 During…
##  2 1912-07-12 06:30 AtlantiCity… Militar… Diri…      5          5      0 First …
##  3 1913-08-06 <NA>  Victoria, B… Private  Curt…      1          1      0 The fi…
##  4 1913-09-09 18:30 Over the No… Militar… Zepp…     20         14      0 The ai…
##  5 1913-10-17 10:30 Near Johann… Militar… Zepp…     30         30      0 Hydrog…
##  6 1915-03-05 01:00 Tienen, Bel… Militar… Zepp…     41         21      0 Crashe…
##  7 1915-09-03 15:20 Off Cuxhave… Militar… Zepp…     19         19      0 Explod…
##  8 1916-07-28 <NA>  Near Jambol… Militar… Schu…     20         20      0 Crashe…
##  9 1916-09-24 01:00 Billericay,… Militar… Zepp…     22         22      0 Shot d…
## 10 1916-10-01 23:45 Potters Bar… Militar… Zepp…     19         19      0 Shot d…
## # ℹ 5,258 more rows
## # ℹ 2 more variables: year <dbl>, day <dbl>

I use the same process to extract month from the Date column.

air$Date <- as.Date(air$Date, format = "%m/%d/%Y")
air$month <- as.numeric(format(air$Date, "%m"))
air %>% mutate( month = air$month)
## # A tibble: 5,268 Ă— 12
##    Date       Time  Location     Operator Type  Aboard Fatalities Ground Summary
##    <date>     <chr> <chr>        <chr>    <chr>  <dbl>      <dbl>  <dbl> <chr>  
##  1 1908-09-17 17:18 Fort Myer, … Militar… Wrig…      2          1      0 During…
##  2 1912-07-12 06:30 AtlantiCity… Militar… Diri…      5          5      0 First …
##  3 1913-08-06 <NA>  Victoria, B… Private  Curt…      1          1      0 The fi…
##  4 1913-09-09 18:30 Over the No… Militar… Zepp…     20         14      0 The ai…
##  5 1913-10-17 10:30 Near Johann… Militar… Zepp…     30         30      0 Hydrog…
##  6 1915-03-05 01:00 Tienen, Bel… Militar… Zepp…     41         21      0 Crashe…
##  7 1915-09-03 15:20 Off Cuxhave… Militar… Zepp…     19         19      0 Explod…
##  8 1916-07-28 <NA>  Near Jambol… Militar… Schu…     20         20      0 Crashe…
##  9 1916-09-24 01:00 Billericay,… Militar… Zepp…     22         22      0 Shot d…
## 10 1916-10-01 23:45 Potters Bar… Militar… Zepp…     19         19      0 Shot d…
## # ℹ 5,258 more rows
## # ℹ 3 more variables: year <dbl>, day <dbl>, month <dbl>

The Location column is in the format city/region/country. I decide to split the values in

the Location column so I can perform analysis in individual areas.

First, I use the str_split_fixed() function to split the column. For example: Victoria, British, Canada becomes |Victoria| and |British, Canada|.

air = air %>% mutate( country = str_split_fixed(air$Location,  ',' , 2)) 

Second, I use the gsub() function to erase the part before the comma in the second column that I just split. For example: British, Canada becomes Canada. Finally, I add that column to my data set and name it Country.

air = air %>% mutate( Country = gsub(".*\\," , "", air$country[, 2])) 

When I look at the number of fatalities and the number of passenger aboard, I think that I could obtain the number of survivals by subtracting the number of fatalities from number of passenger aboard. Therefore, I subtract the values in the Fatalities column from the values in the Aboard column and then add those new values to my data set and name it Survival.

air = air %>% mutate(Survival = Aboard - Fatalities)

It is EDA time !

I will have a quick comment on some of the metrics in the data set.

As we can see, the maximum and minimum values of the year column are 1908 and 2009. The crashes in our data are stated to range from 1908 and 2009 in the metadata. Therefore, the maximum and minimum values in the year column are correct.

summary(air$year)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1908    1954    1973    1971    1990    2009

Next, the third quartile of the ground column is 0. Therefore, we know that there are at least 75% of the crashes did not kill any people on the ground. The mean, however, is 3 people. The maximum value of the ground column is 2750 people. That indicates there is a crash that killed 2750 people on the ground. The value 2750 is very likely an outlier which affects the mean. This extreme outlier is why the mean is greater than the median and even greater than the third quartile. Overall, it is bad to know such a big number of people killed, but now we could know that most of the time when a plane crashes, and we are not in the plane, we are very safe.

summary(air$Ground)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max.     NA's 
##    0.000    0.000    0.000    1.609    0.000 2750.000       22

Now let’s look at the statistics of number of passengers aboard and number of fatalities together! The minimum, median, maximum of the number of passengers aboard are 0, 16, 644 respectively. The minimum, median, maximum of the number of fatalities are 0, 10, 583 respectively. We can sort of see that the number of fatalities increases as the number of passenger aboard increases.

summary(air$Aboard)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    5.00   13.00   27.55   30.00  644.00      22
summary(air$Fatalities)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    3.00    9.00   20.07   23.00  583.00      12

Let’s look at number of passengers on board and the number of fatalities over years.

Looking at the graph, the number of fatalities seems to increase as the number of passengers increase. There seems to be a big gap between the 2 lines at year 1999.

a = air %>% group_by(year) %>% summarise( Fa = sum(Fatalities))

  
b = air %>% group_by(year) %>% summarise( A = sum(Aboard))

c = inner_join(a,b)
## Joining with `by = join_by(year)`
p1 = c %>% ggplot( aes( x= year))
p1 = p1 + geom_line(aes(y = Fa), color = "red")
p1 = p1 + geom_line(aes(y = A), color = "blue")

ggplotly(p1)

Let’s look at the the number of survivals and the number of fatalities by year

Based on the graph, the number of fatalities is almost always quite higher than the number of survivals. From 1908 to 1998, there are always more people killed than survived. Only in year 1999, the number of survivors was higher than the number of fatalities a little bit. After year 1999, however, the number of fatalities went back to being higher than the number of survivals again from 2000 to 2009. Therefore, I think that there is not any odd or long term factors in year 1999 that made the more people survived because the change is just a bit.

class(air$Fatalities) 
## [1] "numeric"
a = air %>% group_by(year) %>% summarise( Fa = sum(Fatalities))
b = air %>% group_by(year) %>% summarise( S = sum(Survival))
c = inner_join(a,b)
## Joining with `by = join_by(year)`
p1 = c %>% ggplot( aes( x= year))
p1 =p1 + geom_line(aes(y = Fa), color = "red")
p1 = p1 + geom_line(aes(y = S), color = "green")

ggplotly(p1)

Let’s look at the the number of survivals and the number of fatalities in year 1999 to see if my guess is right!

As we can see, year 1999 just happens to have 2 big flights on date 7/23 and 12/21 and luckily, the number of survivors on those two fligths are 516 and 298. In summary, year 1999 happens to have 2 big flights and both big flights happen to survive the almost all, bringing up the total number of survivals in that year to be bigger than the number of fatalities. After analysis on both graph, we can see that whenever a plane crashes, there is always tiny chance to survive.

a = air %>% filter(year == 1999) %>% group_by(Date) %>%summarise( S = sum(Survival))
b = air %>% filter(year == 1999) %>% group_by(Date) %>%summarise( Fa = sum(Fatalities))



c = inner_join(a,b)
## Joining with `by = join_by(Date)`
p1 = c %>% ggplot( aes( x= Date))
p1 = p1 + geom_line(aes(y = Fa), color = "red")
p1 = p1 + geom_line(aes(y = S), color = "green")

ggplotly(p1)

Let’s look at the number of crashes by year.

Based on the graph, the number of crashes increased tremendously from 1940 to 1946 and then stay in that range from 1947 to 1989.

The number of crashes in year 1972 was the highest.

After that, we can see a trend that the number of crashes starts to decrease from year 1994 to year 2009.

n = air %>% group_by(year) %>% summarise( accident = n()) %>% 
  ggplot() + aes(x = year, y = accident)  + geom_line() 
ggplotly(n)

Let’s look at the number of crashes by operator

Looking at the bar graph, the Aeroflot and Military - U.S. Air Force are two operators that have the most crashes

air %>% group_by(Operator) %>% summarise( accident = n()) %>% arrange(-accident) %>% head(10) %>% 
  ggplot() + aes(x = Operator, y = accident) + geom_col() + coord_flip()

I want to see what years Aeroflot crashed the most. Based on the graph, it seems like the 2 operators constantly had crashes from around 1957 to 1990

a = air %>% filter(Operator == 'Aeroflot') %>% group_by(year) %>% summarise( crash = n()) %>% arrange(-crash) %>% 
   ggplot() + aes(x = year, y = crash) + geom_col() + ggtitle('Years Aeroflot had the most crashes')
ggplotly(a)
m = air %>% filter(Operator == 'Military - U.S. Air Force') %>% group_by(year) %>% summarise( crash = n()) %>% arrange(-crash) %>% 
   ggplot() + aes(x = year, y = crash) + geom_col() + ggtitle('Years Military - U.S. Air Force had the most crashes')
ggplotly(m)

Let’s see what countries those 2 operators crashed

Looking at those 2 visualization, these 2 operators were probably used a lot during the Cold War. Although Aeroflot had the most crashes in Russia, it also had a lot of crashes when Russia was the USSR. The Aeroflot also crashed in East Germany, North Korea, Cuba, which were the satellite state of the USSR. As for the Military - U.S. Air Force, although the Operator belongs to the United State, it seems to crash a lot in foreign countries which I suppose there was a proxy war in those foreign countries.

a = air %>% filter(Operator == 'Aeroflot') %>% group_by(Country) %>% summarise( crash = n()) %>% arrange(-crash) %>% 
   ggplot() + aes( x = Country, y = crash) + geom_col() + ggtitle('Countries where Aeroflot had the most crashes') + 
  theme(axis.text.x = element_text(angle = 90))

ggplotly(a)
m = air %>% filter(Operator == 'Military - U.S. Air Force') %>% group_by(Country) %>% summarise( crash = n()) %>% arrange(-crash) %>% 
   ggplot() + aes(x = Country, y = crash) + geom_col() + ggtitle('Countries where Military - U.S. Air Force had the most crashes') + 
  theme(axis.text.x = element_text(angle = 90))

ggplotly(m)

Let’s look at the number of crashes by type

Based on the bar graph, Douglas DC-3 has the most crashes.

air %>% group_by(Type) %>% summarise( accident = n()) %>% arrange(-accident) %>% head(10) %>% 
  ggplot() + aes(x = Type, y = accident) + geom_col() + coord_flip()

Let’s see what years Douglas DC-3 crashed the most.

air %>% filter(Type == 'Douglas DC-3') %>% group_by(year) %>% summarise( Year = n()) %>% arrange(-Year) %>% 
  head(10) 
## # A tibble: 10 Ă— 2
##     year  Year
##    <dbl> <int>
##  1  1951    21
##  2  1950    20
##  3  1948    18
##  4  1954    17
##  5  1952    14
##  6  1953    14
##  7  1949    13
##  8  1957    13
##  9  1959    13
## 10  1947    11

Let’s look at the number of passenger killed aboard and the number of people killed on the grounds.

As we can see, the number of passenger killed aboard is alsmot always greater than the number of people killed on the grounds except for year 2001 in which the number of passenger killed on the ground is much greater than the number of people dead aboard

a = air %>% group_by(year) %>% summarise( Fa = sum(Fatalities))

  
b = air %>% group_by(year) %>% summarise( G = sum(Ground))

c = inner_join(a,b)
## Joining with `by = join_by(year)`
p1 = c %>% ggplot( aes( x= year))
p1 = p1 + geom_line(aes(y = Fa), color = "red")
p1 = p1 + geom_line(aes(y = G), color = "blue")
ggplotly(p1)

Let’s zoom in year 2001 and see the number of people killed on the Ground by month.

Based on the bar graph, September is where the number of all people killed on the ground in 2001 lie.

m = air %>% filter( year == 2001) %>% group_by(month) %>% summarise( ground = sum(Ground)) %>% 
  ggplot() + aes(x = month, y = ground)  + geom_col()  
ggplotly(m)

Let’s zoom in September and see what day the number of people killed on the ground lie. Looking at the graph, it lied on day 11.

d = air %>% filter( year == 2001, month == 9) %>% group_by(day) %>% summarise( ground = sum(Ground)) %>% 
  ggplot() + aes(x = day, y = ground)  + geom_col()  

ggplotly(d)

Now that we know the day and month, let’s zoom to September 11, 2001 to see what airplanes crashed on this day. Looking at the bar graph, we found out that the three planes Boeing 767-223ER, Boeing B-757-223, and Boeing B-767-222 are the planes that killed more people on the ground than their own passengers.

air %>% filter( year == 2001, month == 9, day == 11) %>% group_by(Type) %>% 
  ggplot() + aes(x = Type, y = Ground)  + geom_col()

Let’s look at the number of crashes by month. The number of crashes seems to be distributed quite equally among all months

air %>% group_by(month) %>% summarise( accident = n()) %>% 
  ggplot() + aes(x = month, y = accident)  + geom_col()   

Let’s look at the number of crashes by day. The number of crashes seems to be distributed quite equally among all days.

So no matter what month or day we choose to travel, there is no guarantee that a plane does not crash.

air %>% group_by(day) %>% summarise( accident = n()) %>% 
  ggplot() + aes(x = day, y = accident)  + geom_col()   

Let’s look at the number of crashes by countries.

Based on the graph, we could infer that the United States probably has the most crashes because 2 of its states Alaska and California lie in the top 10 countries that have the highest number of crashes. That is interesting to know when a state like Alaska has higher number of crashes than a whole country like Russia.

a = air %>% group_by(Country) %>% summarise( accident = n()) %>% arrange(-accident) %>% head(10) %>% 
  ggplot() + aes( x = Country, y = accident)  + geom_col()   

ggplotly(a)

SUMMARY

Based on the data, we can obtain the number of passengers and survivors, as well what type of aircraft they were on. We can also see which type of aircraft had the most and least crashes from 1908 to 2009. The limitations of this dataset is that it only contains the number of crashes and not the total number of flights throughout those years stated above. If we were to know the total number of flights we could obtain the ratio for them. This makes us wonder that it could be that we might have more crashes because there are more flights as more people tend to fly as the years progress. For example, in 2009 the dataset only included the total number of crashes and not the total flights that year.